Anik Chakraborty

waytoanik@outlook.com

Problem Statemenet:

A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses data analytics to purchase houses at a price below their actual values and flip them on at a higher price. For the same purpose, the company has collected a data set from the sale of houses in Australia. The data is provided in the CSV file below. The company is looking at prospective properties to buy to enter the market. You are required to build a regression model using regularisation in order to predict the actual value of the prospective properties and decide whether to invest in them or not. The company wants to know the following things about the prospective properties:


Reading and Understanding the Data¶


Basic Data Cleanup


Missing value check and imputation using Business Logic

'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'BsmtFinType2', 'BsmtExposure', 'BsmtFinType1', 'BsmtCond', 'BsmtQual'

So here, we will replace NaN values for above attributes withh 'Not Present'.

LotFrontage, GarageYrBlt, MasVnrArea, MasVnrType, Electrical

Initially GarageYrBlt and GarageType both had 5.55% missing value. After imputing NaN values of GarageType with 'Not Available', we can see that GarageYrBlt value is NaN for only those observations where GarageType is 'Not Available'. We can conclude that if garage is not available then there will be no 'GarageYrBlt' value for that. So we can safely impute GarageYrBlt NaN values with 0.

I'll perform statistical imputation for rest of the columns after train-test split: LotFrontage, MasVnrArea, MasVnrType, Electrical

Changing data types

MSSubClass: "identifies the type of dwelling involved in the sale", is a categorical variable, but it's appearing as a numeric variable.

Exploratory Data Analysis


There are 81 attributes in the dataset. So, I am running SweetViz AutoEDA to explore and vizualize the data. The I'll manually explore the attributes that have high correlation coefficient with the target variable.

AutoEDA using SweetViz

Observations from AutoEDA

Numerical Associations with SalePrice:

Categorical Associations with SalePrice:

Visualizing numeric variables:

Visualizing categorical variables:

Inferences

Correlation Heatmap

Below features have very high correlation coefficients.

Data Preparation


Earlier we have already seen that our target variable SalePrice is heavily right skewed. We can perform log transformation to remove the skewness. It will help to boost model performance.

Transforming the Target variable

It can be seen that after log transformation SalePrice has now near normal distribution.

Now, Dropping SalePrice as we have ceate log transformed of it. Also dropping Id column, as it'll not help in predicction.

Dropping unnecessary columns

Train-Test split

Statistical imputation of missing values

Imputing rest of the features in test and train dataset using median (for continuous variables) and mode (for categorical variables) calculated on train dataset.

Encoding categorical (nominal) features

For rest of the categorical (Nominal) columns One Hot Encoding will be used.

Scaling numeric features

During EDA we have observed few outliers in numeric features. So, using Robust Scaling using median and quantile values instead of Standard Scaling using mean and standard deviation.

Variance Thresholding

During EDA, we have seen that there are few categorical features where only a handful of observations differ from a constant value. Remvoing those categorical features having zero or close to zero variance.

It can be seen that Functional_Sev or Functional with 'Sev' type has only one observation in entire dataset.

Model Building


Ridge Regression

Optimal value for alpha is 8.

Lasso Regression

Optimal value for alpha is .0001. Next I'll try to fine tune this value by running GridSearchCV with some closer values to .0001

So, for Lasso we are getting optimal value of alpha as .0006.

Conclusion




Assignment II


Scenario 1: Doubling the value of optimal alpha

Scenario 2: 5 most important predictor variables in the lasso model are not available in the incoming data

As Neighborhood_StoneBr is a dummy variable, we'll drop entire Neighborhood feature.